Detecting Website Redesigns via Template Similarity on Streams of Documents
نویسنده
چکیده
Most websites undergo a redesign from time to time. Along with the change of the appearance of the site comes a different document structure. Hence, redesigns can be detected by observing changes in the structural similarity of monitored HTML documents. Assuming further to monitor not a fixed document set but a series of the newest documents (e.g. provided by an RSS feed) transforms the task of redesign detection into a particular change detection operation on streams of documents. This paper describes and evaluates a simple and three more elaborated approaches to the problem. We show that the detection of redesigns can be achieved automatically, effective and efficient.
منابع مشابه
Plagiarism checker for Persian (PCP) texts using hash-based tree representative fingerprinting
With due respect to the authors’ rights, plagiarism detection, is one of the critical problems in the field of text-mining that many researchers are interested in. This issue is considered as a serious one in high academic institutions. There exist language-free tools which do not yield any reliable results since the special features of every language are ignored in them. Considering the paucit...
متن کاملEvaluation of Similarity Measures for Template Matching
Image matching is a critical process in various photogrammetry, computer vision and remote sensing applications such as image registration, 3D model reconstruction, change detection, image fusion, pattern recognition, autonomous navigation, and digital elevation model (DEM) generation and orientation. The primary goal of the image matching process is to establish the correspondence between two ...
متن کاملSpatial Semantic Scan: Detecting Subtle, Spatially Localized Events in Text Streams
Many methods have been proposed for detecting emerging events in text streams using topic modeling. However, these methods have shortcomings that make them unsuitable for rapid detection of locally emerging events on massive text streams. We describe Spatially Compact Semantic Scan (SCSS) that has been developed specifically to overcome the shortcomings of current methods in detecting new spati...
متن کاملAn Automated Approach to Categoriz he Web Documents through Text Mining
With the increased access of the internet, it has become obvious for all small and big organizations to have an efficacious web presence to acquaint users with the identity of the enterprise. Now a day’s daily routine work of large organization such as communication, document distribution, tender declaration such as notices circular etc is done via websites. Web pages of a website are divided i...
متن کاملHTHquery: a method for detecting DNA-binding proteins with a helix-turn-helix structural motif
SUMMARY HTHquery is a web-based service to determine if a protein structure has a helix-turn-helix structural motif which could bind to DNA. It is based on a similarity with a set of structural templates, the accessibility of a putative structural motif and a positive electrostatic potential in the neighbourhood of the putative motif. A set of scores are computed, based on each template, using ...
متن کامل